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[57] ABSTRACT 

A data storage and retrieval system is provided which 
has extremely high capacity. The system includes a 
large array of small disk files, and three storage manag- 
ers for controlling the allocation of data to the array, 
access to data, and the power status of disk files within 
the array. The allocation manager chooses the disk files 
upon which incoming data is written based on the cur- 
rent state of the disk files (active or inactive), the avail- 
able capacity, and the type of protection desired (i.e 
unprotected, RAID), mirrored, etc.). The access man- 
ager interprets mcorning read requests to determine the 
location of the stored data. The power manager sequen- 
ces disk files between active and inactive to provide the 
storage requested by the access and allocation manag- 
ers. The power manager also maintains the disk array in 
conformance with thermal and power constraints to 
avoid excessive power consumption or thermal over- 
load while keeping active the optimal subset of the disk 
array based on the storage requests pending at any point 
in time. 

11 Claims, 8 Drawing Sheets 
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4 1 200 OFF 
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would require 500 KW or input power to service the 

HIGH CAPACITY DATA STORAGE SYSTEM disk files alone. This would create a power requirement 
USING DISK ARRAY unacceptable to most users, and a cooling problem of 

ptpt n op xph3 TMVTJMnnw c 8 ™ bto P 10 ^? 1 * 0118 - Moreover, using today's technol- 
FIELD OF THE INVENTION 5 ogy tens of thousands or small disk files are required to 
This invention relates in general to computer storage provide this level of storage capacity. This would trans- 
systems, and in particular to a data storage and retrieval kte jato a significant floor space requirement which 
system which includes a large number of disk drives and also is unacceptable to many users. And, any attempt to 
control mechanisms fill handling data storage, data reduce the spacing between individual disk files as a 
retrieval, and power consumption to achieve extremely 10 means or reducing the overall floor space requirement 
high overall storage capacity. would worsen the cooling problem, rendering the data 

BACKGROUND OF THE INVENTION 22^1 flJS' ? * 

meet all user criteria is a data storage system havmg a 

Ongoing development activities aimed at increasing large number (potentially tens of thousands) of closely 
computer processor speed and data transmission rates, 15 spaced small disk files with low power and thermal 
together with the increasing number of applications requirements, 
requiring display, tabulation, synthesis, or transforma- 
tion of enormous amounts of data, arc creating an accel- SUMMARY OF THE INVENTION 
erating demand for high capacity data storage systems. In accordance with the present invention a data stor- 
Examples include multimedia databases with full color 20 age and retrieval system is provided which has ex- 
images and audio, CAD/CAM drawings and associated tremely high capacity. The system includes a large 
data, full text databases with access to daily newspapers number of small disk files, and storage management 
and other periodical information, and scientific data subsystems (storage managers) for controlling the 
involving empirical measurements or results from math- po We r status of the disk files and the allocation of data 
ematocal calculations. Any of these applications can 25 tQ ^ dbk mes ^ tQ ti a ^ ^ 

mvolvemultipleterabytesofoa^ management subsystem (configuration manager) in- 

ity data storage systems which use large amounts of duded m ^ system £ disk ^ 

*%££2^^ data storage involves SSSSSST^ * T> !« T* 

. „^ Lr^Te™ Htcir fiw ornnn^ ^.tUr ;„ 30 a S e requirements. A cluster may consist of a single disk 

and expensive control unit In recent years, die commo- cl »f r . s m °^ed » RAID arrays, 
ditization of small diameter (5J inches or less) disk files °P eratl0n storage system is controlled by 
has led to a revolution in high capacity data storage } h f st0 ^f man f The storage managers minimize 
through which the small groups of large diameter disk 35 ™ teraal the ™^ lc ! adm S •** power consumption for the 
files have been replaced by larger groups (typically 8 to ^sk array by placing clusters m an inactive mode when 
64 disk files) of small diameter, inexpensive disk files. not . m use - Subsequently, clusters may be placed in an 
This metamorphosis has gained additional impetus active mode when one or more of the storage managers 
through the development of new techniques for group- determines that their use may be required. One storage 
ing small disk files that dramatically increase overall 40 manager, known as an allocation manager, chooses the 
system reliability while sharply decreasing data access clusters upon which incoming data is written based on 
times, floor space requirements, and power consump- the current state of each cluster (active or inactive), the 
tion. These techniques have become popularly known remaining capacity in each cluster, and the type of pro- 
as RAID technology (redundant arrays of inexpensive tection desired (i.e unprotected, RAID, mirrored, etc.). 
disks), which categorizes the trade-off between reliabil- 45 The resulting allocation is then provided to a second 
ity and redundancy using RAID levels. Thus, RAID-1 storage manager, known as a power manager. Mean- 
is used to indicate mirrored storage, the most reliable while, a third storage manager, known as an access 
but also the most costly alternative. RAID-5 is used to manager, interprets incoming read requests, determines 
indicate parity protected storage in which parity is the cluster location of the stored data, and provides this 
spread across a set of disk files. Other RAID levels are 50 information to the power manager. The power manager 
available, and are discussed in detail in the literature. collects incoming requests and sequences clusters be- 
Similarly, other data projection mechanisms are avail- tween active and inactive in an optimum manner consis- 
able which are useful in grouping numerous small disk tent with the power and thermal constraints or the data 
files. storage system. 

Existing data storage systems built around RAID 55 rt> T ff DP^PRTPTTOM hf tvtp m?A\xnwrc 
technology typically include approximately 10 to 100 BRIEF DESCRIPTION OF THE DRAWINGS 
individual disks housed in one or more racks, spaced FIG. 1 is a block diagram illustrating the basic archi- 
from one another to allow cooling air to flow between tecture for the data storage system of the present inven- 
them. This approach has proven sufficient to date be- tion. 

cause the total prover and thermal loading created by 60 FIG. 2 is a block diagram illustrating the system 
10 to 100 disks is readily manageable, and because the configuration processing of the present invention, 
total space occupied by such a system, even including FIG. 3 is a block diagram illustrating data flow for 
ample space between the disk files, is manageable. the data storage system of the present invention. 

However, existing packaging and power handling FIG. 4 is a block diagram illustrating an exemplary 
concepts are not sufficient for use with denser and 65 physical layout for low power applications, 
larger arrays required to store high volume data. For FIG. 5 is a table illustrating the status of an exemplary 
example, at a typical power consumption rate or 50 low power data storage system at a particular point in 
mW/megabyte (M-byte), a 10 terabyte disk array time. 
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FIG. 6 is a block diagram illustrating an exemplary 
physical layout for high capacity applications. 

FIG. 7 is a table illustrating a cluster mapping schema 
for an exemplary high capacity data storage system. 

FIG. 8 is a table illustrating the status of an exemplary 5 
high capacity data storage system at a particular point 
in time. 

DETAILED DESCRIPTION OF THE 

INVENTION 1() 

I. System Architecture 

Shown in FIG. 1 are the basic architectural compo- 
nents of the data storage system of the present inven- 
tion. The system includes configuration manager 102, 
allocation manager 104, power manager 106, access 15 
manager 108, disk array 110, global volume table of 
contents (GVTOC) 116, and data/command bus 118. 
Configuration manager 102 itself includes configuration 
map 112 and cluster map 114. Disk array 110 includes a 
plurality of individual disk files, shown as variously 20 
cross-hatched squares. In practice, hundreds or even 
thousands of disk files may be compactly packaged in 
disk array 110. 

Prior to operation, disk array 110 is configured into 
subsets called clusters. The configuration process is 25 
carried out under control of configuration management 
subsystem (configuration manager) 102, and may be 
initiated by information provided during the manufac- 
turing process or alternatively by the user in order to 
customize the system to specific requirements. In either 30 
event, configuration manager 102 accepts data indica- 
tive of what fraction of the available storage will be 
devoted to highly redundant mirrored data, what frac- 
tion will be devoted to RAID-5 data, what fraction will 
be used for unprotected data, etc. together with system 35 
constraints indicative of thermal loading characteristics 
of disk array 110, and determines the appropriate num- 
ber of clusters and cluster sizes to create the desired 
mix, as well as an optimal mapping of disk array 110 into 
the clusters. The mapping is generally calculated to 40 
disperse the disk files in a given cluster such that all files 
in the cluster can be active simultaneously without 
creating a localized thermal overload situation, or "hot 
spot", in disk array 110. A small exemplary portion of 
such a mapping is shown in FIG. 1, where the disk files 45 
in cluster 1 are dispersed from one another, as are those 
in cluster 2 and cluster 3, etc. Once configuration man- 
ager 102 has completed the mapping, it stores the accu- 
mulated data in configuration map 112 and cluster map 
114, as will be discussed in more detail below. 50 

The operation of the data storage system centers 
around power management subsystem (power man- 
ager) 106, which manages disk array 110 such that at 
any point in time some disk files are active and others 
are inactive, and further such that the disk files which 55 
are active are those determined to be the best suited to 
serving the read and write storage requests pending in 
the system at that time. As used herein, "active" refers 
to a disk file that is powered-on and ready to read or 
write data, while "inactive" refers to a disk file that is 60 
not presently ready to read or write because it is pow- 
ered-off or because it is in some intermediate low-power 
quiescent state, such as with its electronics active but its 
disk stopped. Power manager 106 sequences the various 
disk files in accordance with lists of cluster numbers 65 
received from allocation management subsystem (allo- 
cation manager) 104 and access management subsystem 
(access manager) 108. These lists are in turn generated 



4 

in response to storage requests received from computer 
systems attached to data/command bus 118. 

More specifically, storage write requests are received 
by allocation manager 104. A write request typically 
includes a dataset along with commands identifying the 
minimum redundancy level at which the dataset may be 
stored. As used herein, "dataset" refers to any block of 
data which is the subject of a storage request; a dataset 
may be identified by starting point and length, by name, 
or by any other suitable means. The miiriTmim redun- 
dancy level may be unprotected for data whose loss can 
be tolerated, mirrored for critical data, RAID-5 for data 
of intermediate importance, or any other protection 
scheme identified to the data storage system. The allo- 
cation manager determines which cluster or clusters in 
the system are best suited to storing the dataset Factors 
considered by the allocation manager may include the 
space available on various clusters, their redundancy 
level, the performance impact of making various clus- 
ters active or inactive, and the performance advantage 
of spreading a large dataset over multiple clusters, 
among others. In the preferred embodiment, the alloca- 
tion manager develops a list of optimal clusters using a 
linear cost function and a linear constraint function. 

The allocation manager passes the cluster list to the 
power manager, whose function it is to make the clus- 
ters active so that the dataset can be stored. The power 
manager determines which physical disk files must be 
active to fulfill the storage request by referring to clus- 
ter map 114. It then combines these files with those 
from other pending storage requests and determines 
which disk files to keep active, which to activate, and 
which to deactivate. Factors considered by the power 
manager may include the total allowable power con- 
sumption for the disk array, the maximum allowable 
local thermal loading within portions of the disk array, 
the power savings that can be achieved by activating 
fewer disks files than would be allowable based on 
power and thermal loading requirements, and the time 
required to activate and inactive a disk file, among oth- 
ers. In the preferred embodiment, the power manager 
develops a linear cost function and a linear constraint 
function whose combined solutions determine the opti- 
mal set of disk files to be active at any point in time. The 
data is staged to the various disk files as they are acti- 
vated and deactivated, and the storage write operation 
is complete. 

Storage read requests are received by access manager 
108. A read request simply identifies a dataset to be 
extracted from disk array 110. Access manager 108 
determines which cluster or clusters contain the re- 
quested dataset, and passes the cluster list to power 
manager 106. Power manager 106 determines which 
disk files must be activated to fulfill the request, and 
adds them to its cost and constraint functions as de- 
scribed above. As the data becomes available, it is as- 
sembled into the requested dataset. Upon completion 
the dataset is provided to the requesting computer sys- 
tem over data/command bus 118, and the storage read 
operation is complete. 

II. System Configuration 

The data storage system of the present invention 
includes a disk array containing a substantial number 
(hundreds or thousands) of individual small form factor 
(1.8, 2.5, 3.5, etc.) disk files organized into logical clus- 
ters of disk files. A cluster may contain one disk file or 
dozens of disk files, depending upon the system configu- 
ration provided for a given user installation* The cluster 
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configuration itself, which defines the clusters, may be Once the clusters and disk files are assigned, the cor- 
provided by the manufacturer of the data storage sys- respondence between configuration commands and 
tem or alternatively may be customized by the user to assigned cluster numbers is stored in configuration map 
suit specific requirements. ni. Likewise, the correspondence between cluster 

Shown m FIG. 2 is a block diagram representing the 5 numbers and physical disk files is stored in cluster map 
system configuration process for the present invention. 114. The contents of configuration map 112 are pro- 
Configuration manager 102 receives configuration com- vided to allocation manager 104, which uses this infor- 
mands from the user, or in the case of a default configu- mation during operation to determine cluster numbers 
ration provided by the manufacturer, from a configure- based on the redundancy requirements for datasets re- 
tion dataset. The configuration commands specify the 10 ceived from attached computer systems. The contents 
amount of storage capacity by quantity, percentage, or 0 f cluster map 114 are provided to power manager 106 
any other suitable measure, that is to be allocated at which uses this information during operation to deter- 
various levels of redundancy. This may include unpro- mine physical disk files based on cluster numbers 
^^'i 018 ! e L mirrored storage, RAID level 5 storage, HI. Application of Linear Prograrnming to Configu- 
RAID level 3 storage, etc. The configuration com- 15 ration, Power, and Allocation Management 
mands may also specify the desired excess capacity as i„ mapping the disk array into clusters, the configura- 
weU as the desired access rate for a particular block of tton manager seeks to spread clusters and the individual 
storage, such as high speed access, low speed access, disk files in the clusters so that in operation the maxi- 
etc. Configuration manager 102 combines the required mum possible number of clusters canbe active simulta- 
capacity, excess capacity, and redundancy mformadon 20 neous i y , consistent with the total power limit and local 
to determine the number of disk files required to satisfy thermal loading constraints for the storage system, 
tiie configuration command. It assigns a cluster number once the disk array has been mapped into clusters, the 
to the command, and tiien allocates physical disk files in po wer manager is responsible for controlling the power 
the storage array to the cluster. The files are allocated status of each cluster so that, at any given rwint intime, 
according to the systems thermal I load constraints, 25 the total power comumption of the disk array is within 
which causes them to be physically dispersed from one aUowable limits and the localized thermal lead is within 
another such hat localized hot spots will not occur ax ow!oUi limits. ln accordance with the preferred em- 
when the cluster is made active. One possible imple- bodiment of the present invention, a linear program- 
mentation is to assign clusters using a simple regular ^ approach is used by the configuration manner 
algorithm for example ffles 1 to n assigned to duster 1, 30 ^ ^ er to ' determine the optimal orga- 

files n+ 1 to 2n assigned to cluster 2, etc., provided the ^ za&}n for ^ ^ ^ ^ ^ Q timaI £ wer 
selected assignment meets the thermal constraints such for ^ cluster m ^ dfck ^Qons&mts corre- 
ct all files in the cluster can be active concurrently. ndin tQ h ^ requirements y (such „ total „ w f r 
Other assignment options include square clusters or oooslmpAo ^ aBA ^thermal loading) are identified 

cubic clusters. 35 and quantified, and used by the configuration manager 
A linear, square, or cubic assignment as described .„ j_* . „ .i „ .• , . ' _ ? . . e 

above may imnecessarily limit, the number of clusters £ deVel ° P the 1 P Z Foster nmjw In operation, 

ouw c m*jf uim^w»iii;ium^ vuc uumua ui wuaicia these same constraints are combined with data availabil- 

that can be active at one time. In the preferred embodi- Mctc OP ™- , - +u + u ^ r ^ TT * 

«. , . , r Al _ . . ity costs associated with the active/inactive state of 

ment, clusters are assigned using a mathematical pro- _ , . • . . . . ' . . : 

~ . rT. f 3 • j j- clusters to determine, at any point in time, the optimal 

gramming function which creates the required disper- 40 M n * - ' , J j vymmu 

. . ° , • • | « 5 . set oi clusters to be made active, 

sion in accordance with principles to be discussed m _ . ^ r . . , , 

more detail below. Briefly, constraints are identified to J™ .""T^ f " ^ clust f rs ' system 

describe the minimum allowable distances between disk Ef ^ ^% ^ "J^*?™* 

files within a cluster and in adjoining clusters. An opti- ^^IV^^IT ^ " 

mmng function is developed wmch maximizes the dis- 45 r , f ' ^ hea t ^\^ » 

persion of disk files, consistent with the bounds of the |! ve md F(l)= 1 ^ cluster i when that cluster is active, 
array and the foregoing constraints. 

The difficulty of the assignment process can range F(i)+F<2)+ ..,+p©'... +F(n)Sm. <n 

from nearly trivial to very complex, particularly if the 

number of clusters that can be active based on local 50 Additionally, if D(i) is defined as the desirability of the 
thermal loading constramts is sigmficandy less than the m clu s ter ^ acti then ^ ^ ^ . 
number that can be active based on total power con- given by 
sumption constraints. For example, if a cube of 8 X 8 X 8 

disk files is known to create a hot spot, a constraint F(i)xD(i)+F(2)xD(2)+ 

would be established such that 64 adjacent clusters 55 +F(n)xD(n)=maximum p) 

(8x8x8) could never be active at the same time. If the 

power limit for the entire array is less than 64 clusters, Thus, the configuration manager must devise a cluster 
a hot spot will never occur, regardless of cluster selec- mapping and the power manager must select the set of 
tion. But if the power limit for the array is significantly active clusters such that (1) and (2) are both satisfied, 
greater than 64 clusters, it becomes preferable to spread 60 However, even if the total system power is con- 
the individual disk files associated with a given cluster strained in conformance with (1), localized "hot spots" 
over a larger area so that any 63 clusters may be active may result if numerous disk files are active in a concen- 
without creating a hot spot. As a further complication trated region of the disk array. Additional constraints 
to the configuration process, different clusters may may be added to address this exposure. The exact nature 
contain different numbers of disk files. For instance, 65 of these constramts depends upon the cluster configura- 
mirrored clusters may contain a different number of tion of the array. For example, if the clusters are ar- 
files from RAID clusters, which in turn may contain a ranged such that within any set of ten clusters thermal 
different number of files from unprotected clusters. loading constraints permit only two clusters to be active 
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simultaneously, then the additional constraints defined begin immediately upon receipt of the storage read/- 
are: write request. 

Having established its constraints, the allocation man- 

F(i)+F(2)+F(3)+ . . . +F(I0)32 ager proceeds to construct a cost function which repre- 

5 sents the overhead (time and storage space) involved in 

F(il)+F<l2)+F(l3)+ . . . +F(20)£2 storing the data in a cluster having more than the re- 

quired level of redundancy, the expense involved in 

F(n-9)+F(n-8)+F(n-7)+ . . . +F<n)52 making an inactive disk file active, and the expense of 

„ , , t , , _ . , storing data in a cluster having more than the required 

In the above case, both the configuration manager and w leve] of availability . ^ equation set representing the 
the power manager handle local thermal loading at the costs ^ constraints is solved by the allocation man- 
cluster level. In a more complex embodiment in which ager t0 produce m optimal allocation of the data set 
cluster size varies or the disk files comprising the van- over the ava ilable capacity. The resulting cluster num- 
ous clusters are distributed unevenly throughout the bers are passed to the power manager so that the appro- 
disk array, the configuration manager and power man- 15 pnate disk files can be made active and the data re- 
ager may be required to consider the detailed location corded. 

of each disk file in the cluster to handle local thermal it is to be noted that while the preferred embodiment 
loading constraints. In this case the constraint functions of the present invention uses linear programming con- 
are constructed with reference to disk files rather than cepts to provide an optimal set of available disk files, 
cluster numbers. Then the configuration manager and 20 other techniques may be readily substituted, including 
the power manager convert cluster numbers to disk files heuristics and statistical tracking. Additionally, any 
before solving the constraint functions. Otherwise oper- number of constraints and costs may be added to those 
ations proceed the same as where configuration discussed above in determining the optimal availability 
management/power management is conducted at the mix for a given installation and the optimal spread of 
cluster level. 25 ^ta across the available capacity. For example, a con- 

Costs are assigned at configuration time in accor- straint may be used to further limit the density of avail- 
dance with the configuration selected by the system able disk files in regions of the storage array known to 
user. As used herein, "cost" refers to the desired level of receive low thermal cooling; another constraint may be 
availability associated with a particular cluster. A low used to lower total power consumption during certain 
cost is attributed to clusters requiring high-speed access 30 times of the day or week when overall usage may be 
and to clusters for which frequent access is expected. particularly low or utility rates particularly strained. A 
This latter group may include clusters containing point- ^st function may be added to increase the cost of larger 
ers to data in other clusters, clusters containing global r «f ****** may be used to lower the 

volume-table-of-contents (GVTOC) information for the 0081 of P artlcular at ^ * uch as clus- 

entire disk array, and clusters containing directory in- 35 |f re containing users personal storage during business 
formation for ^ operating system^g on a com- ^StlES^ 

ess SMS rsrs -; -r -* " — — « 

which a lower access rate can be tolerated. This may ^ y 5*E Data Flow 

include cluster, storing *^.*f^ ShoZ in FIG. 3 is a block diagram representing the 

mf^uentlyreference^orbu^ ^ Hqw ^ m ^ ^ ^ 

Dunng operation of the data storage system, the costs ^ nonnaI operation, a data write se- 

assocmted with various clusters may be dynamically e be ^ s w ^ er 104 receives a 

modified by request of the allocation manager or the 45 ^ command that mcludes a size identifier md a 

storage manager. Such requests are placed m response e class identifier . The size identifier indicates the 

to actual storage and retrieval frequency experienced m amorat of data tQ be ^ class identifier 

specific clusters, as well as anticipated requirements. mdica tes the minimum required redundancy level for 

In addition to the constraints and costs governing the ^ ^ such ^ rajq level 3 Allocation man ager 104 

function of the configuration manager and the power 50 ^ches configuration map 112 for one or more clusters 

manager, a separate set of constraints and costs are having the appropriate redundancy level, and then 

applied by the allocation manager to determine the checks GVTOC 116 to detennine available capacity. If 

optimal cluster or clusters upon which to store incom- sufficient storage cannot be located at the minimum 

ing data. A constraint is established for each cluster required redundancy level, clusters at the next more 

representing the excess data capacity remaining in that 55 secure redundancy level are checked until sufficient 

cluster. A second constraint, called a "striping con- space is found for the data. Allocation manager 104 also 

straint," is added if it is desired to force the dataset to be applies cost and constraint functions to the available 

spread across multiple clusters in accordance with re- clusters to determine the optimal one(s) to fulfill the 

dundancy techniques well know in the RAID art In write request based on availability, capacity, etc., as 

general, striping of this sort also tends to reduce system 60 described above. 

overhead and improve performance by enabling the The determined cluster identifiers) are then passed to 

storage system to begin transferring data to/from active power manager 106. Power manager 106 searches clus- 

clusters while simultaneously activating other clusters ter map 114 to determine the physical disk files corre- 

whose use is required to complete the storage read/- sponding to the cluster numbers). Power manager 106 

write request. The effect is enhanced further if a portion 65 then calculates its cost and constraint functions for iden- 

of each striped dataset is allocated to a normally-active tified physical disks 410, and proceeds to make them 

cluster (a cluster with a low cost function assigned to it), available for use — by adjusting costs if necessary — in 

since this virtually guarantees that data transfer can accordance with the resulting solution to the optimiza- 
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tion problem. Clusters that are not needed are assigned two disk files can be active at a time. For illustrative 

a high cost, and clusters that are required are assigned a purposes, it is further assumed that at the time of the 

low cost These costs can be modified as necessary for snapshot an incoming dataset of 400 M-bytes is to be 

aging or high priority requests. Power manager 106 stored in the array, that the user has chosen to store data 

maintains a record of the cluster sequencing corre- 5 in at least 100 M-byte blocks, and that clusters will not 

sponding to each request so that clusters can be brought be filled to 100% of capacity if an alternative exists, 

on line in anticipation of their use by dynamically ad- With only clusters 2 and 8 active, there is not sufficient 

justing their cost functions. Finally, when the clusters capacity to store the entire dataset in active files, so 100 

are available, the data is transmitted to the disk file(s), M-bytes is stored in cluster 8 while cluster 2 is being 

and GVTOC 116 is updated to record the location of 10 deactivated. Since cluster 5 has sufficient capacity 

the data - (hence lowest cost), cluster 5 is activated while the first 

A data read sequence begins when access manager 100 M-bytes is being transferred to cluster 8. Once the 
108 receives a read command that identifies a dataset first 100 M-bytes is transferred to cluster 8, the remain- 
stored in the disk array. Access manager 108 searches ing data is stored in cluster 5. Finally, to complete the 
GVTOC 116 to determine the clusters) on which the 15 storage write request the GVTOC is updated to indi- 
data is located, and then provides this information to cate the cluster locations chosen for the dataset. 
power manager 106. Power manager 106 searches clus- VI. Exemplary High Capacity Configuration 
ter map 114 to determine the physical disk files corre- Shown in FIG. 6 is a block diagram representing a 
sponding to the cluster number(s). Power manager 106 data storage system which includes a very high capacity 
then adjusts the cost functions for the identified physi- 20 disk array. The disk array contains a total or 10,912 disk 
cal disks and proceeds to make them available for use in files, each with a capacity of 5 G-bytes, for an aggregate 
accordance with the resulting solution to the optimiza- capacity of 54.56 T-bytes. The user or manufacturer of 
tion problem. Finally, when the clusters) are available the exemplary storage system has determined that a 
the data is read from the disk files and provided to the desirable configuration is: 15 T-bytes of unprotected 
requesting computer system. 25 data, 15 T-bytes of RAID-3, 15 T-bytes of RAID-5, and 

As is shown in FIG. 3, in the preferred embodiment the remainder (4.78 T-bytes) mirrored data. This infor- 
data is transmitted through the power manager on its mation is provided to the configuration manager at 
way to and from the disk array, without passing configuration time. The configuration manager deter- 
through the allocation manager. However, numerous mines the number of clusters needed to achieve the 
alternatives are available which would not represent a 30 requested configuration based on the capacities of the 
departure from the spirit and scope of the present inven- disk files in the disk array and the optimal cluster sizes 
tion. For instance, data accompanying a write request for the various levels of redundancy. In this example, 
could be staged through the allocation manager, then each cluster is simply assigned the same number of disk 
directly into the disk array upon signal from the power files, eight. The result is 375 clusters of unprotected 
manager. Likewise, data for a read request could be 35 data, 375 clusters of RAID-3 data, 375 clusters of 
staged through the access manager. Or, an additional RAID-5 data, and 119 clusters of mirrored data. This 
caching storage unit could be added to the subsystem, information forms the substance of the configuration 
and all data staged into the cache, with a direct link map. 

between the cache and the disk array. Or, if an imple- Next, the configuration manager spreads the clusters 
mentation required, all write data could be routed 40 across the disk array in accordance with the predeter- 
through both the power manager and the allocation mined constraints and costs. In the present example, it is 
manager, while all read data could be routed through assumed the local thermal loading constraint allows no 
both the power manager and the access manager. more than 512 adjacent disk files to be active simulta- 
V. Exemplary Low Power Configuration neously, and further that the total thermal loading con- 
Shown in FIG. 4 is a block diagram representing a 45 straint allows no more than 1024 files to be active 
simple application of the present invention to a low- throughout the entire array at any one time. Based on 
power disk array. Such a configuration may be applied these constraints, many different duster layouts could 
to a portable computer system, such as a laptop or other be constructed. For the chosen cluster size of 8 disk 
personal computer having a constrained power supply, files, the approach selected assigns the physical disk 
to achieve high storage capacity and availability with 50 files for each cluster to occupy a row in array 610. Thus, 
low power consumption. The arrows indicate direction cluster 1 consists of the eight disk files at cartesian coor- 
of flow of array management information. Disk array dinates y=l, z=l, and x=l to 8. Similarly, cluster 12 
402 is divided into 8 clusters labelled 411-418, one clus- consists of the eight disk files at coordinates y= 1, 
ter occupying each disk file. To conserve power, a z=12, and x=l to 8. The complete physical disk file 
constraint is established such that at most two clusters 55 mapping for all the clusters is shown in FIG. 7. This 
arc active at a time. No thermal loading constraints arc information forms the substance of the cluster map. 
required in an array of this size. The allocation man- Finally, based on this information the configuration 
ager, access manager, and power manager handle stor- process is completed when the configuration manager 
age read and write requests as described above, se- loads the configuration data into the configuration map 
quencing power among the 8 clusters such that all appli- 60 and the cluster data into the cluster map as described 
cable constraints are satisfied and overall storage cost is previously. 

minimized. A snapshot of the first 24 clusters of the data storage 

A snapshot of the low power disk array of FIG. 4 system of FIG. 6 during operation is shown in the table 

during operation is shown in the table of FIG. 5. The of FIG. 8. The table is organized by cluster number, 

table is organized by cluster number, maximum capac- 65 The information shown for each cluster includes redun- 

ity, available capacity and cluster state (e.g. active— on, dancy level, maximum capacity, unused capacity, active 

inactive— off). It is assumed that the system is powered status, number or disk files in the cluster, and physical 

by batteries, and hence is constrained such that at most positions of the disk files in the disk array. It is to be 



03/19/2004, EAST version: 1.4.1 



F(2)+F(3)<2 



5,423,046 

11 12 

noted that this information need not be accumulated in -continued 
a single table in the data storage system, but may instead 95*15) + 80*16) + 95*17) + 120/^19) + 120*22) + 
be subdivided and maintained according to any reason- 
able schema. All relevant information is shown com- 80*23) + 80*24) = min 
pressed into FIG. 8 simply for ease of presentation. S 
Exemplary of the information contained in FIG. 8 is m ^ 
cluster 12. The redundancy level for cluster 12 is mir- 
rored storage having a total capacity of 20 G-bytes; 10 '°+ a(J^^ * ' ' 
G-bytes are available for new data. The cluster is inac- 
tive at the time of the snapshot. It includes 8 physical 10 where A (i) is a minimum of 100 M . byteSf or ^ avail . 
disk files, located m the dusk array at cartesian coordi- able capacity of cluster i. The allocation manager solves 

nates y- 12, x- 1 to 8, and z- 1. these equations to arrive at the following optimal distri- 

To avoid excessive complexity, it will be assumed butk>n of data: lO-M-bytes on cluster 1, 100 M-bytes on 
that clusters 25 and above are full and hence unavailable cluster 11, 100-M bytes each on clusters 7,8,13,16, and 
for data storage. For illustrative purposes, it is further 15 ^ ^ 90 M-bytes on cluster 3 
assumed that at the time of the snapshot an incoming The allocation manager passes the above-determined 
dataset is received having 700 Megabytes of data, and duster fist t0 ^ power mana ger, which factors this 
that the dataset is targeted for RAID-3 storage. The information into its own cost and constraint equations, 
allocation manager determines the appropriate clus- i n the present example, it is assumed that the system is 
ter(s) for the dataset using the configuration map with 20 sub j ect tQ ^ ^^0^ constraints that at most 4 of the 
cost and constraint functions determined according to clusters in the range from 1 to 24 may be active at one 
the principles discussed above. Full clusters and clusters time and that adjacent clusters may not be active simul- 
at too low a redundancy level are not considered. Clus- taneously. This yields the constraints: 
ters which contain critical information are always as- 
signed low costs so that they will tend to be kept active. 25 f(1)+F(2)+F(3)+f<4) . . . +F(24)^4 
Higher redundancy levels are assigned higher costs to 
reflect the higher overhead associated with storing data and 
at those levels. Active clusters are assigned lower costs 
than inactive ones to enhance overall storage speed. F(i)+F(2)<2 
Optionally, the cost of recovery for various redundancy ^° 
levels may be factored in. Also, constraints may be used 
to require file spreading across several clusters, possibly F(3)+F(4)<2 
with the first portion of the file assigned to a usually- 
active cluster. 

In the present example, cluster 21 is not considered 35 
for use in satisfying the storage request since it is al- F(23)+F(24)<2 
ready full. Assuming cluster 1 to contain critical — avail- 
ability system data, it is assigned a very low cost, such The cost optimizing function for the power manager 
as 1. The mirrored clusters generally are assigned a high is a linear combination of the requirements from the 
cost, such as 100, to reflect the expense of mirrored allocation manager (clusters 1,3,7,8,11,1 3,16,23) and 
storage. Unprotected clusters, which use half the stor- other pending requests from the allocation manager and 
age space of mirrored clusters, are assigned a cost of 50. the access manager. Assuming for purposes of the pres- 
RAID-3 storage is more expensive than unprotected ent example that there are no other pending requests, 
storage but less expensive than mirrored storage; ac- the power manager determines its cost function to be: 
cordingly, it is assigned a cost of 60. For RAID-5 stor- 

age a cost of 75 is chosen to reflect the relative expense ioof<3)+ ioof(7)+ ioof<8)+ ioof(i i)+ 10- 

of storage at this level. Additionally, the cost of each of(13)+ioof(16)+ iooF<23)=minimum 

inactive cluster is increased by 20 to reflect the perfor- - . . 

mance cost of activating disk files. Finally, a constraint Referring again to the snapshot status shown in FIG. 8, 
is established to take advantage of the high activity of 50 ? 18 apparent that cluster 6 is to be deactivated since it 
cluster 1: 10 M-bytes of the incoming file will be stored ^ not required to fulfill the storage request. Cluster 1, 
on cluster 1 and up to 100 M-bytes each will be stored which is already active, will be kept active. Similarly 
on other clusters until enough space is assembled to with clusters 7 and 11 except that they will be deacti- 
contain the entire 700 M-byte dataset. These costs and vated data allocations since their 
constraints are merely illustrative; others may be chosen 55 m ^Sher. Ousters 3,8,16,23, and 13 will be made 
readily by either the designer or the user of the data as permitted by the power and thermal con- 
storage system in accordance with the types of data stramt f' Smce P° wer constraint allows up to 4 ac- 
present, the mix of redundancy levels, and the desired ti , ve clusters at a time, the power manager activates 
performance. cluster 3 while data is being stored on clusters 1,7, and 
After eliminating those clusters shown in FIG. 8 that 60 1L 01106 ^ e first 1 10 M-bytes is stored on clusters 1 and 
are either full or at a lower redundancy level than llf lusteT 7 13 made «ctiwe and cluster 8 is activated. 
RAID-3, the allocation manager develops the following ™f P rocess continues, with clusters being activated 
optimization and constraint equations: 811(1 deactivated as data is stored, until completed. The 

staging of disk files enables the continuous transfer of 

,«,x «w™ ««™ 0 A»o* 65 data to active files while others are being deactivated 

1*1) + 95*2) + 80*3) + tOOTO + 75*7) + 80*8) + ^ ^ ^ ^ %Q & ^ 

120*9) + 95*10) + 60*11) + 120*12) + 75*13) + ^ e does n °t affect the performance of the data storage 

system. Moreover, since cluster 1 is always active there 
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is no lag time for initial start-up of a dataset transfer, 
even if none of the other clusters chosen by the alloca- 
tion manager are initially active. Finally the GVTOC is 
updated to show the location of the 700 M-byte dataset 
in the disk array, and the process is complete. 

In conjunction with processing the above-described 
write request, the data storage system of the present 
invention may also receive and process additional write 
requests and/or read requests. Additional write requests 
would be handled as described above. Read requests are 
handled by the access manager, which locates the re- 
quested dataset by searching for its name in the 
GVTOC, and then passes the appropriate cluster num- 
bers to the power manager so that the clusters can be 
activated and the data extracted. Thus, continuing with 
the exemplary snapshot of FIG. 8, it is assumed a read 
request is received for a 410 M-byte dataset. The dataset 
identifier is cross referenced into the system GVTOC, 
which reveals that the dataset is stored on clusters 
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for identifying the physical disk files allocated in 

the disk array to the logical cluster; 
an output addressibly connected to power controls of 

the physical disk files; and 
an array power optimizer, coupled to the input and 

the output, for activating and deactivating the 

physical disk files in the array in accordance with a 

constraint function. 

2. The power management subsystem as recited in 
claim 1, wherein the constraint function includes a con- 
straint for representing maximum allowable local ther- 
mal load within the disk array. 

3. The power management subsystem as recited in 
claim 1, wherein the constraint function includes a con- 
straint for representing maximum allowable power load 
for the disk array. 

4. The power management subsystem as recited in 
claim 1, wherein the array power optimizer permits 



1,4,5,18, and 20. This list is provided to the power man- 2 q onl y a subset of the disk files in the array to be active at 



ager, which constructs costs and constraint equations. If 
there are no other requests from the access manager or 
the allocation manager, the power manager would 
choose to activate clusters based on the equation 

F(l)+F(4)+F(5)+F(18)+F(20)=max 



30 



while maintaining the thermal constraints. Assuming 
the same initial state as FIG. 8, cluster 1 is active but 
clusters 4, 5, 18, and 20 are inactive. Data is retrieved 
from cluster 1, while clusters 6, 7, 11, and 13 are being 
deactivated, and clusters 4, 5, 18, and 20 are being acti- 
vated. Since clusters 4 and 5 are adjacent they cannot be 
activated simultaneously, so clusters 4, 18, and 20 are 
initially activated. After the data has been read from 
cluster 4, cluster 4 is deactivated, then cluster 5 is acti- 
vated and its data retrieved. Since the data may not 
become available in the order in which it was stored, it 
is staged to a cache separately associated with the data 
storage system, and assembled into the complete dataset 
before being supplied to the requesting computer sys- 
tem, all in accordance with techniques well known in 
the disk controller art. 
VII. Conclusion 

It is to be noted that while the invention has been 
described in the context of disk arrays, it is readily appli- 
cable to any densely packed electronics system in which 
it is necessary or desirable to manage power and ther- 
mal loading. For instance, random access memory mod- 
ules having densely packed memory chips may benefit 
from application of the present invention. Vector pro- 
cessors and parallel computers having many closely 
spaced circuit boards are additional candidates. 

Also, while the invention has been particularly de- 
scribed and illustrated with reference to a preferred 
embodiment, it will be understood by those skilled in 55 
the art that changes in the description or illustrations 
may be made with respect to form or detail without 
departing from the scope of the invention. Accordingly, 
the present invention is to be considered as encompass- 
ing all modifications and variations coming within the 60 
scope defined by the following claims. 

What is claimed is: 

1. A power management subsystem for controlling 
power status of a plurality of physical disk files in a disk 
array, said power management subsystem comprising: 
an input for receiving an identified cluster number 
representing a logical cluster of disk files to be 
activated and for receiving mapping information 



the same time. 

5. The power management subsystem as recited in 
claim 1, wherein the array power optimizer further 
determines the power status of the physical disk files in 

25 accordance with a cost function. 

6. The power management subsystem as recited in 
claim 5, wherein the cost function includes a cost estab- 
lished to maintain selected clusters in a normally-active 
state. 

7. The power management subsystem as recited in 
claim 5, wherein the cost function includes a cost for 
representing a time required to activate a cluster in 
anticipation of its use. 

8. A data storage and retrieval system for storing data 
35 on a plurality of physical disk files, said data storage and 

retrieval system comprising: 
a cluster map having entries for identifying logical 
clusters of disk files and the physical disk files allo- 
cated in the storage and retrieval system to the 
logical clusters; 
an allocation manager for assigning logical clusters in 
which to store data, the allocation manager having 
an input for receiving a write command identifying 
data to be stored in the data storage and retrieval 
system and an output for providing a cluster num- 
ber identifying a cluster on which the data is to be 
stored; 

a power manager for controlling power status of the 
physical disk files, the power manager having an 
input for receiving the identified cluster number 
and mapping information for identifying the physi- 
cal disk files allocated in the storage and retrieval 
system to the identified cluster number, and an 
output addressibly connected to power controls of 
the physical disk files; and 
an access manager for identifying logical clusters in 
which requested data is stored, the access manager 
having an input for receiving a storage request 
from a data processing device and an output for 
providing a cluster number identifying a cluster on 
which the requested data is stored. 

9. The data storage and retrieval system as recited in 
claim 8, wherein the allocation manager assigns clusters 
in accordance with a constraint function and a cost 
function, and the power manager controls power status 
in accordance with a constraint function and a cost 
function. 
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10. The data storage and retrieval system as recited in 
claim 9, further comprising: 

a configuration manager for defining logical clusters 
of disk files in accordance with received configura- 
tion commands and for defining sets of physical 5 
disk files allocated in the data storage and retrieval 
system to the logical clusters, the physical disk files 
being spatially dispersed from one another accord- 
ing to a constraint function; 

a configuration map for receiving the mapping infor- 10 
mation identifying configuration commands and 
the assigned logical clusters identified by the con- 
figuration commands; and 

wherein the cluster map receives the mapping infor- 
mation identifying the logical clusters and the 15 
physical disk files allocated in the data storage and 
retrieval system to the logical clusters. 

11. A data storage and retrieval system, comprising: 
a plurality of data recording disk files; 

a configuration manager for defining logical clusters 20 
of disk files in accordance with received configura- 
tion commands and for defining sets of physical 
disk files allocated in the data storage and retrieval 
system to the logical clusters, the physical disk files 
being spatially dispersed from one another accord- 25 
ing to a constraint function; 

a configuration map for receiving mapping informa- 
tion identifying the configuration commands and 
the assigned logical clusters identified by the con- 
figuration commands; 30 
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a cluster map for receiving the mapping information 
identifying the logical clusters and the physical 
disk files allocated in the data storage and retrieval 
system to the logical clusters; 

an allocation manager for assigning logical clusters in 
which to store data, the allocation manager having 
an input for receiving a write command identifying 
data to be stored in the data storage and retrieval 
system and an output for providing a cluster num- 
ber identifying a cluster on which the data is to be 
stored, the allocation manager assigning clusters in 
accordance with a constraint function and a cost 
function; 

a power manager for controlling power status of the 
physical disk files, the power manager having an 
input for receiving the identified cluster number 
and mapping information for identifying the physi- 
cal disk files allocated in the data storage and re- 
trieval system to the identified cluster number, and 
an output addressibly connected to power controls 
of the physical disk files, the power manager con- 
trolling power status in accordance with a con- 
straint function and a cost function; and 

an access manager for identifying logical clusters in 
which requested data is stored, the access manager 
having an input for receiving a storage request 
from a data processing device and an output for 
providing a cluster number identifying a cluster on 
which the requested data is stored. 
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